Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer

27

where ∥· ∥2 denotes2 normalization and l, h are the layer index and the head index.

Previous work shows that matrices constructed in this way are regarded as specific patterns

that reflect the semantic understanding of the network [226]. And the patches encoded

from the input images contain a high-level understanding of parts, objects, and scenes [83].

Thus, such a semantic-level distillation target guides and meticulously supervises quantized

ViT. The corresponding ˜Gl

qh;T and ˜Gl

kh;T are constructed in the same way by the teacher’s

activation. Thus, combining the original distillation loss in Eq. (2.17), the final distillation

loss is formulated as

LDGD =



l[1,L]



h[1,H]

˜Gl

qh;T ˜Gl

qh2 +˜Gl

kh;T ˜Gl

kh2,

Ldistillation = Ldist + LDGD,

(2.23)

where L and H denote the number of ViT layers and heads. With the proposed Distribution-

Guided Distillation, Q-ViT retains the distribution over query and key from the full-

precision counterparts (as shown in Fig. 2.7).

Our DGD scheme first provides the distribution-aware optimization direction by pro-

cessing appropriate distilled parameters. Then it constructs similarity matrices to eliminate

scale differences and numerical instability, thereby improving fully quantized ViT by accu-

rate optimization.

2.3.5

Ablation Study

Datasets. The experiments are carried out on the ILSVRC12 ImageNet classification

dataset [204]. The ImageNet dataset is more challenging because of its large scale and

greater diversity. There are 1000 classes and 1.2 million training images, and 50k validation

images. Our experiments use the classic data augmentation method described in [224].

Experimental settings. In our experiments, we initialize the weights of the quantized

model with the corresponding pre-trained full-precision model. The quantized model is

trained for 300 epochs with a batch size of 512 and a base learning rate 2e4. We do not

use the warm-up scheme. We apply the LAMB [275] optimizer with the weight decay set to

0 for all experiments. Other training settings follow DeiT [224] or Swin Transformer [154].

Note that we use 8-bit for the patch embedding (first) layer and the classification (last)

layer following [61].

Backbone. We evaluate our quantization method on two popular implementations of vision

transformers: DeiT [224] and Swin Transformer [154]. The DeiT-S, DeiT-B, Swin-T, and

Swin-S are adopted as the backbone models, whose Top-1 accuracy on the ImageNet dataset

are 79.9%, 81.8%, 81.2%, and 83.2%, respectively. For a fair comparison, we utilize the

official implementation of DeiT and Swin Transformer.

We give quantitative results of the proposed IRM and DGD in Table 2.1. As shown in

Table 2.1, the fully quantized ViT baseline suffers a severe performance drop on the classifi-

cation task (0.2%, 2.1%, and 11.7% with 2/3/4 bits, respectively). IRM and DGD improve

performance when used alone, and the two techniques enhance performance considerably

when combined. For example, IRM improves the 2-bit baseline by 1.7%, and DGD achieves

a 2.3% performance improvement. When IRM and DGD are combined, a performance im-

provement is achieved at 3.8%.

In conclusion, the two techniques can promote each other to improve Q-ViT and close

the performance gap between the fully quantized ViT and the full-precision counterpart.